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Abstract We present a highly general implementation of fast multipole meth- 
ods on graphics processing units (GPUs). Our two-dimensional double preci- 
sion code features an asymmetric type of adaptive space discretization leading 
to a particularly elegant and flexible implementation. All steps of the mul- 
tipole algorithm are efficiently performed on the GPU, including the initial 
phase which assembles the topological information of the input data. Through 
careful timing experiments we investigate the effects of the various peculiarities 
of the GPU architecture. 
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1 Introduction 

We discuss in this paper implementation and performance issues for adaptive 
fast multipole methods (FMMs). Our concerns are focused on using mod- 
ern high-throughput graphics processing units (GPUs) which have seen an 
increased popularity in Scientific Computing in recent years. This is mainly 
thanks to their high peak floating point performance and memory bandwidth, 
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implying a theoretical performance which is an order of magnitude better than 
for CPUs (or even more) . However, in practice for problems in Scientific Com- 
puting, the floating point peak performance can be difficult to realize since 
many such problems are bandwidth limited [24]. Although the GPU proces- 
sor bandwidth is up to 4 times larger than that of the CPU, this is clearly 
not sufficient whenever the parallel speedup is (or could be) much larger than 
this. Moreover, according to the GPU computational model, the threads has 
to be run synchronously in a highly stringent manner. For these reasons, near 
optimal performance can generally only be expected for algorithms of predom- 
inantly data-parallel character. 

Another difficulty with many algorithms in Scientific Computing is that 
the GPU off-chip bandwidth is comparably small such that the ability to 
mask this communication becomes very important [24]. Since the traditional 
form of many algorithms often involves intermediate steps for which the GPU 
architecture is sub-optimal, a fair degree of rethinking is usually necessary to 
obtain an efficient implementation. 

Fast multipole methods appeared first in [4, 9] and have remained impor- 
tant computational tools for evaluating pairwise interactions of the type 

N 

#(xi)= J2 G ( x i> x j)> ^eIl D , i = l...N, (l.l) 

where D € {2, 3}. More generally, one may consider to evaluate 

N 

<%;)= Yl G (yi> x j)> i = l--.M, (1.2) 

where {y{\ is a a set of evaluation points and {xj} a set of source points. In this 
paper we shall also conveniently use the terms potentials or simply particles 
to denote the set of sources {xj}. 

Although the direct evaluation of (1.1) has a complexity of O (N 2 ^, the 
task is trivially parallelizable and can be performed much more efficiently us- 
ing GPUs than CPUs. For sufficiently large N, however, tree-based codes in 
general and the FMM algorithm in particular become important alternatives. 
The practical complexity of FMMs scales linearly with the input data and, 
moreover, effective a priori error estimates are available. Parallel implemen- 
tations are, however, often highly complicated and balancing efficiency with 
software complexity is not so straightforward [20, 23]. 

In this paper we present a double precision GPU implementation of the 
FMM algorithm which is fully adaptive. Although adaptivity implies a more 
complex algorithm, this feature is critical in many important applications. 
Moreover, in our approach, all steps of the FMM algorithm are performed on 
the GPU, thereby reducing memory traffic to the host CPU. 

Successful implementations of the FMM algorithm for GPUs have been 
reported previously [12, 26, 27] under certain limitations. Specifically, with 
GPUs the performance of single precision algorithms is a factor of at least 2 
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times better than double precision [16, p. 11]. In fact, for computationally in- 
tensive applications, this factor can reach as high as 8 times [25], which implies 
that single precision speedups vis-a-vis CPU implementations can well be > 2. 
It should be noted, however, that simpler tree-based methods than the FMM 
exist that offer a better performance at low tolerance (cf. [3], [10, Chap. 8.7]), 
and that the FMM is of interest mainly for higher accuracy demands. 

In Section 2 we give an overview of our version of the adaptive FMM. The 
details of the GPU implementation are found in Section 3 and in a separate 
Section 4 we highlight the algorithmic changes that were made to the original 
serial code described in [7]. In Section 5 we examine in detail the speed-ups 
obtained when moving the various phases of the algorithm from the CPU to 
the GPU. We also reason about the results such that our findings may benefit 
others who try to port their codes to the GPU. Since the FMM has been judged 
to be one of the top 10 most important algorithms of the 20th century [5], it 
is our hope that insights obtained here is of general value. A final concluding 
discussion around these matters is found in Section 6. 

Availability of software The code discussed in the paper is publicly available 
and the performance experiments reported here can be repeated through the 
Matlab-scripts we distribute. Refer to Section 6.1 for details. 

2 Well-separated sets and adaptive multipole algorithms 

In a nutshell, the FMM algorithm is a tree-based algorithm which produces a 
continuous representation of the potential field (1.2) from all source points in a 
finite domain. Initially, all potentials are placed in a single enclosing box at the 
zeroth level in the tree. The boxes are then successively split into child-boxes 
such that the number of points per box decreases with each level. 

The version of FMM that we consider is described in [7] and is organized 
around the requirement that boxes at the same level in the tree are either 
decoupled or strongly /weakly coupled. The type of coupling between the boxes 
follows from the 0-criterion, which states that for two boxes with radii r\ and 
r 2 , whose centers are separated with distance d, the boxes are well-separated 
whenever 

R + 6r<9d, (2.1) 

where R = max{ri,r 2 }, r = min{ri,r 2 }, and 9 G (0, 1) a parameter. In this 
paper we use the constant value 6 = 1/2 which we have found to perform well 
in practice. At each level / and for each box b, the set S(p) of strongly coupled 
boxes of its parent box p is examined; children of S(p) that satisfy the 6- 
criterion with respect to b are allowed to become weakly coupled to b, otherwise 
they remain strongly coupled. Since a box is defined to be strongly connected 
to itself this rule defines the connectivity for the whole multipole tree. In 
Figure 2.1 an example of a multipole mesh and its associated connectivity 
pattern is displayed. 



4 



Anders Goude, Stefan Engblom 



All boxes are equipped with an outgoing multipole expansion and an in- 
coming local expansion. The multipole expansion is the expansion of all sources 
within the box around its center and is valid away from the box. To be con- 
crete, in our two-dimensional implementation we use the traditional format of 
a p-term expansion in the complex plane, 

M(z) = a a \og(z-z )+j2 ( w ( 2 - 2 ) 

where zq is the center of the box. The local expansion is instead the expansion 
of sources far away from the box and can therefore be used to evaluate the 
contribution from these sources at all points within the box: 

L(z) = f2bj(z-z y, (2.3) 

where again (2.3) is specific to a two-dimensional implementation. 

The computational part of the FMM algorithm proceeds in an upward and 
a downward phase. During the first stage the multipole-to-multipole (M2M) 
shift from children to parent boxes recursively propagates and accumulates 
multipole expansions. In the second stage the important multipole-to-local 
(M2L) shift adds to the local expansions in all weakly coupled boxes which arc 
then propagated downwards to children through the local-to-local shift (L2L). 
At the finest level, any remaining strong connections are evaluated through 
direct evaluation of (1.1) or (1.2). A simple optimization which was noted al- 
ready in [4] is that, at the lowest level, strongly coupled boxes are checked with 
respect to the ^-criterion (2.1), but with the roles of r and R interchanged. If 
found to be true, then the potentials in the larger box can be directly shifted 
into a local expansion in the smaller box, and the outgoing multipole expansion 
from the smaller can directly be evaluated within the larger box. 

The algorithm so far has been described without particular assumptions 
on the multipole mesh itself. As noted in [7], relying on asymmetric adaptiv- 
ity when constructing the meshes makes a very convenient implementation 
possible. In particular, this construction avoids complicated cross-level com- 
munication and implies that the multipole tree is balanced, rendering the use of 
post-balancing algorithms [21] unnecessary. Also, the possibility to use a static 
layout of memory is particularly attractive when considering data-parallel im- 
plementations for which the benefits with adaptive meshes have been ques- 
tioned [12]. 

In this scheme, the child boxes are created by successively splitting the 
parent boxes close to the median value of the particle positions, causing all 
child boxes to have about the same number of particles. At each level, all 
boxes are split twice in succession, thus producing four times as many boxes 
for each new level. The resulting FMM tree is a pyramid rather than a general 
tree for which the depth may vary. The cost is that with a balanced tree, 
the communication stencil is variable. Additionally, it also prevents the use 
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Fig. 2.1 The adaptive mesh is constructed by recursively splitting boxes along the coordi- 
nate axes in such a way that the number of source points is very nearly the same in the four 
resulting boxes, (a) Here the boxes colored in light gray will interact via multipole-to-local 
shifts with the black box, that is, they satisfy the (^-criterion (8 = 1/2). The boxes in dark 
gray are strongly connected to the black box and must be taken care of at the next level 
in the tree, (b) Same mesh as in (a), but visualized as a distribution by letting the height 
of each box be inversely proportional to its area. The source points in this example were 
sampled from a normal distribution. 



of certain symmetries in the multipole translations as described in [14]. To 
improve on the communication locality, the direction of the split is guided by 
the eccentricity of the box since the algorithm gains in efficiency when the 
boxes have equal width and height (the ^-criterion is rotationally invariant). 

The algorithmic complexity of the FMM has been discussed by many au- 
thors. Practical experiences [2], [10, Chap. 8.7], [11, Chap. 6.6.3], indicate that 
linear complexity in the number of source points is observed in most cases, 
but that simpler algorithms perform better in certain situations. Although it 
is possible to construct explicit distributions of points for which the FMM 
algorithm has a quadratic complexity [1], this behavior is usually not observed 
in practical applications. 

With p terms used in both the multipole and the local expansions, we 
expect the serial computational complexity of our implementation to be pro- 
portional to 9~ 2 p 2 ■ N, with N the number of source points. This follows from 
assuming an asymptotically regular mesh such that R ~ r in (2.1) and a total 
of on the order of N boxes at the finest level. Then each of those N boxes 
interact through M2L-interactions with about ird 2 x TV other boxes. From (2.1) 
we get d <~ (l + 6)/6 xr~ (y/NO)' 1 , and since the M2L-interaction is a linear 
mapping between p coefficients this explains our stated estimate. This simple 
derivation assumes that the M2L-shift is the most expensive part of the algo- 
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rithm. In practice, the cost of the direct evaluation of (1.1) may well match 
this part. From extensive experiments with the serial version of the algorithm, 
we have seen that it is usually possible to balance the algorithm in such a way 
that these two parts take roughly the same time. 

With a given relative target tolerance TOL, the analysis in [7] implies 
p ~ log TOL / log 6, so that the total complexity can be expected to be on the 
order of 6»~ 2 log -2 0-N log 2 TOL. We now proceed to discuss an implementation 
which distributes this work very efficiently on a GPU architecture. 

3 GPU Implementation 

The first major part of the adaptive fast multipole method is the topological 
phase which arranges the input into a hierarchical FMM mesh and determines 
the type of interactions to be performed. We discuss this part in Section 3.2. 
The second part is discussed in Section 3.3 and consists of the actual multipole 
evaluation algorithm with its upward and downward phases performing the 
required interactions in a systematic fashion. 

To motivate some of our design choices, we choose to start with a brief 
discussion on the GPU hardware and the CUDA programming model ( Com- 
pute Unified Device Architecture). The interested reader is referred to [8, 17] 
for further details. 

3.1 Overview of the GPU architecture 

Using CUDA terminology, the GPU consists of several multiprocessors (14 for 
the Tesla C2075), where each multiprocessor contains many CUDA cores (32 
for the C2075). The CUDA execution model groups 32 threads together into a 
warp. All threads in the same warp execute the same instruction, but operate 
on different data; this is simply the GPU-version of the Single Instruction 
Multiple Data (SIMD)-model for parallel execution. Whenever two threads 
within the same warp need to execute different instructions, the operations 
become serialized. 

To perform a calculation on the GPU, the CPU launches a kernel contain- 
ing the code for the actual calculation. The threads on the GPU are grouped 
together into blocks, where the number of blocks as well as the number of 
threads per block is to be specified at each kernel launch (for performance rea- 
sons the number of threads per block is usually a multiple of the warp size). All 
threads within a block will be executed on the same multiprocessor and thread 
synchronization can only be performed efficiently between the threads of a sin- 
gle block. This synchronization is required whenever two threads write to the 
same memory address. Although for this purpose there are certain built-in 
atomic write operations, none of these support double precision for the Tesla 
C2075. 

The GPU typically has access to its own global memory and all data has 
to be copied to this memory before being used. Further, each multiprocessor 
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has a special fast shared memory, which can be used for the communication 
between threads within the same block [8, Chap. 5]. For example, the C2075 
has 48 kB of shared memory per multiprocessor for this purpose. Overuse of 
this memory limits the number of threads that can run simultaneously on a 
multiprocessor, which in turn has a negative effect on performance. 

3.2 Topological phase 

The topological phase consists of two parts, where the first part creates the 
boxes by partitioning the particles (we refer to this as "sorting") and the 
second part determines the interactions between them ( "connecting"). 

The sorting algorithm successively partitions each box in two parts accord- 
ing to a chosen pivot point (Algorithm 3.1). The pivot element is obtained by 
first sorting 32 of the elements using a simple O (n 2 ) algorithm where each 
thread sorts a single point (32 elements was chosen to match the warp size). 
The pivot is then determined by interpolation of the current relative position 
in the active set of points so as to approximately land at the global median 
point (line 2, Algorithm 3.1). 

Algorithm 3.1 (Partitioning with successive splits) 
Input: Unordered array consisting of x- or y-coordinates. 
Output: Array partitioned around its median coordinate. 
1: while size (array) > 32 do 
2: determine jpivot-32() 
3: spliLaround-pivot() 
4: keep-part-Containing-median() 
5: end while 

6: {the array now consists of < 32 elements:} 
7: determine_median_32() 

The split in line 3 uses a two-pass scheme, where each thread handles a 
small set of points to split. The threads start by counting the number of points 
smaller than the pivot. Then a global cumulative sum has to be calculated. 
Within a block, the method described in [13] is used. For multiple blocks, 
atomic addition is used in between the blocks, thus allowing the split to be 
performed in a single kernel call (note that using atomic addition makes the 
code non-deterministic) . Given the final cumulative sum, the threads can cor- 
rectly insert their elements in the sorted array (second pass). After the split, 
only the part containing the median is kept for the next pass (line 4). 

Algorithm 3.1 can be used to partition many boxes in a single kernel call 
using one block per box. If the number of boxes is low, it is desirable to use 
several blocks for each partitioning to better use the GPU cores. This requires 
communication between the blocks and the partitioning has to be performed 
with several kernel calls according to Algorithm 3.2. The splitting code is 
executed in a loop (lines 2 to 9) and a small amount of data transfer between 
the GPU and the CPU is required to determine the number of loops. 
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Algorithm 3.2 (Partitioning with successive splits, CPU part) 
Input: Unordered array consisting of x- or y-coordinates. 
Output: Array partitioned around its median coordinate. 

l: determinesplit-direction() 

2: while maxi size(arrayi) > single JhreadJimit do 
3: { executed in parallel:} 
4: for all splits do 
5: determine-pivoL32() 
6: partition_around-pivot() 
7: keep-part-Containing-median() 
8: end for 
9: end while 
10: split_on_singleJ)lock() 

Running the code using multiple blocks forces the code to run synchro- 
nized, with equal amount of splits in each partitioning. If one bad pivot is 
encountered, then this split takes much longer time than the others resulting 
in bad parallel efficiency. By contrast, a single block running does not force 
this synchronization and, additionally, allows for a better caching of the ele- 
ments since they remain in the same kernel call all the time. For these reasons, 
Algorithm 3.2 switches to single block mode (line 10) when all splits contain 
a small enough number of points (single JhreadJimit = 4096 in the current 
implementation) . 

The second part of the topological phase determines the connectivity of 
the FMM mesh, that is, if the boxes should interact via near- or far-field 
interactions. This search is performed for each box independently and the 
parallelization one thread/box is used here. Each kernel call calculates the 
interactions of one full level of the FMM tree. 



3.3 Computational part: the multipole algorithm 

The computational part consists of all the multipole-related interactions, which 
include initialization (P2M), shift operators (M2M), (M2L), and (L2L), and 
local evaluation (L2P). Additionally, we also include the direct interaction in 
the near-field (P2P) in this floating point intensive phase of the FMM algo- 
rithm. During the computational part, no data transfer is necessary between 
the GPU and the host. 

3.3.1 Multipole initialization 

The initialization phase creates multipole expansions for each box via particle- 
to-multipole shifts (P2M). Since each source point gives a contribution to each 
coefficient aj, using several threads per box requires intra-thread communi- 
cation which in Algorithm 3.3 is accomplished by introducing a temporary 
matrix to store the coefficients. 
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Initially (line 5, Algorithm 3.3), one thread calculates the coefficients for 
one source particle (two threads if the number of particles is less than half 
the number of available threads). Then each thread calculates the sum for 
one coefficient (line 7). This procedure has to be repeated for a large number 
of coefficients, as it is desirable to have a small temporary matrix to limit 
the use of shared memory. The current implementation uses 64 threads per 
box (two warps) and takes 4 coefficients in each loop iteration (8 in the two 
threads / particle case) . 

The initialization also handles the special case where the particles are con- 
verted directly to local expansions via particle-to-local expansions (P2L). The 
principle for creating local expansions is the same as for the multipole expan- 
sions. All timings of this phase will include both P2M and P2L shifts. 

Algorithm 3.3 (Multipole initialization) 

Input: Positions and strengths for source particles in a box. 
Output: Multipole coefficients en for the box. 
1: {executed in parallel:} 
2: for all sources in box do 
3: load_one_source-per-thread() 
4: for k = 1 to p do 

5: temp_array := calc_N_cache .coefficients () 

6: synchronize_threads() 

7: ak •'= afe + sum (temp .array) 

8: end for 

9: end for 

3.3.2 Upward pass 

In the upward pass, the coefficients of each of four child boxes are repeatedly 
shifted to their parent's coefficients via the M2M-shift. This is achieved using 
Algorithm 3.4 (a) which is similar to the one proposed in [15]. Algorithm 3.4 (a) 
can be parallelized by allowing one thread to calculate one interaction, and at 
the end compute the sum over the four boxes (line 14, Algorithm 3.4). It should 
be noted that the multiplication in Algorithm 3.4 is a complex multiplication 
which is performed O (p 2 ) times. By introducing scaling, the algorithm can be 
modified to Algorithm 3.4 (b), which instead requires one complex division, 
O (p) complex multiplications and O (p 2 ) complex additions. The advantage 
of this modification in the GPU case is not the reduction of complex multipli- 
cations, but rather that the real and imaginary parts are independent. Hence 
two threads per shift can be used thus reducing the amount of shared memory 
per thread. 

Algorithm 3.4 (Multipole to multipole translation) 
Input: Multipole coefficients aj of child box at position z c . 
Output: Multipole coefficients aj of parent box at position z p . 
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(a) Without scaling (b) With scaling 
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a := sum_translations() 


a := sum_translations() 



3. 3. 3 Downward pass 

The downward pass consists of two parts, the translation of multipole expan- 
sions to local expansions (M2L), and the translation of local expansions to the 
children of a box (L2L). The translation of local expansions to the children is 
very similar to the M2M-shift discussed previously, and can be achieved with 
the scheme in Algorithm 3.5 (b). This shift is slightly simpler on the GPU 
since there is no need to sum the coefficients at the end, but instead requires 
more memory accesses as the calculated local coefficients must be added to 
already existing values. 

Algorithm 3.5 (Local to local translation) 

Input: Local coefficients bj of parent box at position z p . 
Output: Local coefficients bj of child box at position z c . 
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The translation of multipole expansions to local expansions is the most time 
consuming part of the downward pass. The individual shifts can be performed 
with a combination of the reduction scheme in the M2M translation and the 
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L2L translation, see Algorithm 3.6. Again, this implementation allows for two 
dedicated threads for each shift. We have not seen this algorithm described 
elsewhere. 

Algorithm 3.6 (Multipole to local translation) 
Input: Multipole coefficients aj of box at position z% . 
Output: Local coefficients bj of box at position z a . 

1: r := z a - zi 

2: for j = 1 to p do 

3: := aj /ri ■ (-iy 

4: end for 

5: b p := 

6: for k = 2 to p do 
7: for j = p — k to p — 1 do 
8: b 3 := bj - bj i 
9: end for 
10: end for 

11: for k — p downto 1 do 
12: for j = k to p do 

13: bj := bj + bj-i 

14: end for 

15: end for 

16: b := b a -a - log(r) 

17: for j — 1 to p do 

18: bj := {bj - a /j)/ri 

19: end for 

20: b := sum-translations () 

With the chosen adaptive scheme, the number of shifts per box varies. Since 
our GPU does not support atomic addition in double precision, one block has 
to handle all shifts of one box in order to perform this operation in one kernel 
call. As the number of translations is not always a multiple of 16 (the number 
of translations per loop if 32 threads/block is used), this can result in idle 
threads. One can partially compensate for this by giving one block the ability 
to operate on two boxes in the same loop. 

As the M2L translations are performed within a single level of the multipole 
tree, all translations can be performed in a single kernel call. This is in contrast 
to the M2M- and L2L translations, which both require one kernel call per level. 

3.3.4 Local evaluation 

The local evaluation (L2P) is scheduled in parallel by using one block to cal- 
culate the interactions of one box. Moreover, one thread calculates the in- 
teractions of one evaluation point from the local coefficients of the box. This 
operation requires no thread communication and can be performed in the same 
way as for the CPU. The local evaluation uses 64 threads/block. 
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This phase will also include the special case where the evaluation is per- 
formed directly through the multipole expansion (M2P). This operation is 
performed in a similar way as for the L2P evaluation. All timings of this phase 
therefore include both L2P- and M2P evaluations. 

3.3.5 Near- field evaluation 

In the near-field evaluation (P2P), the contribution F from all boxes within 
the near-field of a box should be calculated at all evaluation points of the box. 
Similar to the M2L translations, the number of boxes in the near-field varies 
due to the adaptivity. 

Algorithm 3.7 (Direct evaluation between boxes) 

Input: Positions and strengths of particles in the near field. 
Output: The contribution F of the particles in the near field. 

1: {executed in parallel:} 

2: load_evaluation_point_positions() {one per thread} 
3: for all interaction boxes do 
4: cacheAnteractionjpositionsQ 
5: if cache-is -full or all -positions -loaded then 
6: for all elements in cache do 
7: F := F + add_pairwiseJnteraction() 

8: end for 
9: end if 
10: end for 

The GPU implementation uses one block per box and one to four threads 
per evaluation point (depending on the number of evaluation points compared 
to the number of threads). The interaction is calculated according to Algo- 
rithm 3.7, where the source points are loaded into a cache in shared memory 
(line 4) and when the cache is full, the interactions are calculated (line 7). This 
part uses 64 threads per block and a suitable cache size is to use the same size 
as the number of threads. 

4 Differences between the CPU and GPU implementations 

In this section we highlight the main differences between the two versions of 
the code. A point to note is that when we compare speed-up with respect 
to the CPU-code in Section 5, we have taken care in implementing several 
optimizations which are CPU-specific. 

4.1 Topological phase 

When sorting, the GPU implementation is based on sorting 32 element arrays 
for choosing a pivot element. This design was made to achieve a better use 
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of the CUDA cores and, in the multiple block/partitioning case, to make all 
partitionings behave more similar to each other. The single-threaded CPU 
version uses the median- of -three procedure for choosing the pivot element, 
which is often used in the well-known algorithm quicksort [19, Chap. 9]. An 
advantage with this method compared to the GPU algorithm is that it is 
in-place and hence that there is no need for temporary storage. 

4.2 Computational part 

The direct evaluation can use symmetry on the CPU if the evaluation points 
are the same as the source points since the direct interaction is symmetric. 
Using this symmetry, the run time of the direct interaction can be reduced by 
almost a factor of two in the CPU version. This is not implemented on the 
GPU as it would require atomic memory operations to store the results (which 
is not available for double precision on our GPU). 

The operations M2M, P2P, P2M, and M2P are all in principle the same in 
both versions of the code. For the M2L shift, the symmetry of the ^-criterion 
(2.1) can be used in the scaling phases on the CPU (lines 2-4 and 17-19 
in Algorithm 3.6), while in the GPU version, the two shifts are handled by 
different blocks, making communication unpractical. Another difference is that 
the scaling vector is saved in memory from the pre-scaling part to the post- 
scaling part after the principal shift operator. This was intentionally omitted 
from the GPU version as the extra use of shared memory decreased the number 
of active blocks on each multiprocessor, and this in turn reduced performance. 

4.3 Symmetry and connectivity 

The CPU implementation uses symmetry throughout the multipole algorithm. 
With symmetry, it is only required to create one-directional interaction lists 
when determining the connectivity. 

As the GPU implementation does not rely on symmetry when evaluating, 
it is beneficial to create directed interaction lists. This causes twice the work 
and twice the memory usage (for the connectivity lists). However, the time 
required to determine the connectivity is quite small (~ 1%, see Table 5.1). 

4.4 Further notes on the CPU implementation 

The current CPU implementation is a single threaded version. Other research 
[6, 24] has shown that good parallel efficiency can be obtained for a multicore 
version (e.g. 85% parallel efficiency on 64 processors). However, the work in 
[6] does not appear to use the symmetry in the compared direct evaluation. 

In order to achieve a highly efficient CPU implementation, as suggested 
also by others [22, 27], the multipole evaluation part was written using SSE 
intrinsics. Using these constructs means that two double precision operations 
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can be performed as one single operation. The direct- and multipolc evaluation, 
as well as the multipole initialization all use this optimization, where two points 
are thus handled simultaneously. Additionally, all shift operators also rely on 
SSE intrinsics in their implementation. 



4.5 Double versus single precision 

The entire algorithm is written in double precision for both the CPU and the 
GPU. Using single precision would be significantly faster, both on the CPU (as 
SSE could operate on 4 floats instead of 2 doubles) and on the GPU (where 
the performance difference appears to vary depending on the mathematical 
expression [25]). It is likely that higher speedups could be obtained with a 
single precision code, but it would also seriously limit the applicability of the 
algorithm. With our approach identical accuracy is obtained from the two 
codes. 



5 Performance 

This section compares the CPU and GPU codes performance-wise. The sim- 
ulations were performed on a Linux Ubuntu 10.10-system with an Intel Xeon 
6c W3680 3.33 GHz CPU and a Nvidia Tesla C2075 GPU. The compilers were 
GCC 4.4.5 and CUDA 4.0. For all comparisons of individual parts, the sort- 
ing was performed on the CPU to ensure identical multipole trees (the GPU 
sorting algorithm is non-deterministic). All timings have been handled using 
the timing functionality of the GPU. The total time includes the time to copy 
data between the host and the GPU, while the time of the individual parts 
does not include this. All simulations were performed a large number of times 
such that the error in the mean when regarded as an estimator to the expected 
value was negligible. Overall we found that the measured times displayed a 
surprisingly small spread with usually a standard deviation which was only 
some small fraction of the measured value itself. 

All performance experiments were conducted using the harmonic potential, 

Gizuzj)^-^-, (5.1) 

Zj Z; L 

in (1.1) and hence a = in (2.2). Moreover, in Sections 5.1 through 5.3, 
all simulations were performed using randomly chosen source points, homoge- 
neously distributed in the unit square. 

We remark again that the performance experiments reported here can be 
repeated using the scripts distributed with the code itself, see Section 6.1. 
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5.1 Calibration 

From the perspective of performance, the most critical parameter is the num- 
ber of levels in the multipole tree. Adding one extra level increases the number 
of boxes at the finest level in the tree by four. Assuming that each box connects 
to approximately the same number of boxes at each level, the total number of 
pairwise interactions therefore decrease with a factor of about 4. The initial- 
ization and multipole evaluation require the same amount of operations, but 
will operate on an increasing number of boxes, thus increasing the memory 
accesses. For all shift operations, one additional level implies about a three- 
fold increase of the total number of interactions and the same applies for the 
determination of the connectivity information. For the sorting, each level re- 
quires about the same amount of work, but handling many small boxes easily 
causes additional overhead. 

It is expected that the CPU code will follow this scaling quite well, while for 
the GPU, where several threads should run synchronously, this is certainly not 
always the case. As an example, L2P operates by letting one thread handle the 
interaction of one source point, P2M can use up to two threads, and P2P can 
use up to four threads/point (all these use 64 threads/block). On the tested 
Tesla C2075 system, this means that the local evaluation of a box containing 
1 evaluation point takes the same amount of time as a box containing 64 
evaluation points (on a Geforce GTX 480 system, this only applies to up to 
32 evaluation points which is the warp size here). This shows the sensitivity 
of the GPU implementation with respect to the number of points in each box. 
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Fig. 5.2 Time as a function of the number of sources per box for the CPU and the 
GPU implementation, both normalized so that the fastest time = 1. 

In Figure 5.1, the GPU speedup as a function of the number of sources 
per box is studied. Within this range, the shift operators and the connectivity 
mainly depend on the number of levels and therefore obtain constant speedups 
(hence we omit them). All parts that depend on the number of particles in 
each box obtain higher speedups for larger number of particles per box. This 
is expected, since it is easier to get a good GPU load for larger systems. There 
is also a performance decrease when the number of particles increases above 
32, 64, and so on, that is, at multiples of the warp size. The direct evaluation 
additionally shows a performance decrease directly after each multiple of 16, 
which is due to the fact that the algorithm can use 4 threads per particle. 

The small high frequency oscillations seen in the speedup of L2P and P2P 
originates from the CPU algorithm, and is due to the use of SSE instructions 
which makes the CPU code more efficient for an even number of sources per 
box. 

It should be noted again that the direct evaluation and connectivity both 
make use of symmetry in the CPU version. This means that the speedup would 
be significantly higher (almost a factor 2) if the CPU version did not rely on 
this symmetry. 

As adding one extra level reduces the computational time of the direct 
interaction, but increases the time requirement for most other parts, it is 
necessary to find the best range of particles per box. The number of levels Ni 
is calculated according to 




(5.2) 
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Part 


Time 


Part 


Time 


P2P 


43 % 


Connect 


1 % 


Sort 


30 % 


M2M 


< 1 % 


M2L 


11 % 


L2L 


< 1 % 


P2M 


5 % 






L2P 


2 % 


Other 


8 % 



Table 5.1 Time distribution of the GPU algorithm. The most expensive part in the case 
studied here is the direct evaluation (P2P), followed by sorting and M2L translations. The 
field "other" contains all data transfers between the host and the GPU. 



where N is the number of particles and N d is the desired number of particles 
per box. This parameter choice was studied for 150 simulations with different 
number of particles (from 1 x 10 4 to 2 x 10 6 ). The result (normalized against 
the lowest time on each platform) is found in Figure 5.2, showing that a value 
around 45 is best for the GPU, while 35 is best for the CPU. Even though the 
GPU has poor speedup for low number of particles, it still scales better than 
the CPU in this case. The reason is that with low values of Nd, the multipole 
shifts dominate the computational time. This simulation was performed with 
17 multipole coefficients, giving a tolerance of approximately 1 x 10 -6 . The 
tolerance is here and below understood as 

(5.3) 

oo 

where & exact is the exact potential and <Pfmm is the FMM result. 

For the optimal value 45 of Nd, the time distribution of the different parts 
of the algorithm is given in Table 5.1 for N — 45 x 2 16 , which gives 45 sources in 
each box at the finest level of the FMM tree. According to (5.2), using Nd = 45 
gives 8 levels for N £ (18 x 2 16 ,72 x 2 16 ]. Within this interval, the time of 
P2P relative to the total time varies between 25% to 55%. It is particularly 
interesting to note in Table 5.1 that the sorting dominates by a factor of about 
3 over the usually very demanding M2L-operation. 



5.2 Shift operators 

The performance of the sorting and direct evaluation depends on the number 
of sources per box and the number of levels while the connectivity to a first 
approximation only depends on the number of levels. The rest of the operators 
also depend on the number of multipole coefficients (the number of multipole 
coefficients determine the error in the algorithm) . Multipole initialization and 
evaluation depends linearly on the number of coefficients, while the shift op- 
erators have two linear parts (pre- and post-scaling) and one quadratic part 
(the actual shift). In the GPU case, all accesses to global memory are in- 
cluded in the linear parts while all data is kept in shared memory during the 
shift. A higher number of coefficients increases the use of shared memory and 
fewer shift operations can therefore be performed in parallel. The speedup as 



TOL = 
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Fig. 5.3 Speedup as a function of the number of multipole coefficients p. 

a function of number of coefficients is plotted in Figure 5.3, where the simu- 
lation was performed on 1 6 particles with Nd = 45. The decrease in speedup 
due to lack of shared memory can be seen quite clearly, e.g. at 42 coefficients 
for the M2L-shift, where one block less (3 in total) can operate on the same 
multiprocessor . 

The difference in speedup for L2P at low number of coefficients is likely due 
to overhead, since these values stabilize at high enough number of coefficients. 

Considering that the time required for the shift operators increases with 
increasing number of coefficients, the optimal value for Nd changes as well. 
Figure 5.4 shows that the optimal number for Nd increases approximately 
linearly with increasing number of coefficients. 

5.3 Break-even 

If the number of sources is low enough, it may be faster to use a direct sum- 
mation instead of the fast multipole method. In Figure 5.5, the speed of the 
entire algorithm is compared with the speed of direct summation for both the 
CPU and the GPU implementation. The speedup of the GPU code increases 
with the number of particles since more source points provide a better load 
for the GPU. Looking at the direct summation times, the GPU scales linearly 
in the beginning where the number of working cores is still increasing, and 
later scales quadratically as the cores become fully occupied. Since a double 
sum is easily performed in parallel and is not memory bandwidth dominated 
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Fig. 5.4 Optimal value of as a function of the number of multipole coefficients p. 

the direct evaluation provides a good estimate of the maximum speedup that 
can be achieved with the GPU. Recall, however, that symmetry is used in the 
CPU implementation, which almost speeds up the calculation with a factor 
of 2. Figure 5.5 shows that it is more beneficial to use the FMM algorithm if 
the number of points exceeds about 3500 on the GPU. This result compares 
favorably with that reported by others [26]. For large N, the speedup of the 
direct interaction is higher than that of the FMM (15 compared to about 11, 
see Figure 5.6). Again, one should note that the CPU version uses symmetry 
here. For simulations where the source points and evaluation points are sepa- 
rate, the speedup is about 30 for the direct evaluation and 15 for the FMM. 
The lower increase in speedup for the FMM is due to the fact that only the 
P2P-evaluation of the algorithm uses this symmetry (compare Table 5.1). 

Comparing the individual parts (Figure 5.7), the M2L- and P2P-shifts 
quite rapidly obtain high speedups, while the sorting requires quite a large 
number of points. The poor values for M2M and L2L at low number of particles 
are due to the fact that few shifts are performed at the lower levels, causing 
many idle GPU cores. The situation is the same for the connectivity. As these 
algorithms have to be performed one level at a time, the low performance of 
the shifts high up in the multipole tree decreases the performance of the entire 
step. Consequently, the speedup increases with an increasing number of source 
points. The oscillating behavior of the multipole initialization and evaluation 
is related to the number of particles in each box (compare with Figure 5.1). 

The code has been tested both on the Tesla platform used for the above 
figures, and on a Geforce GTX 480 platform (which has 480 cores, compared 
to 448 for the Tesla card). The total run-time is approximately the same on 
both platforms. Notable differences are that the P2P interaction is faster on 
the Tesla if Nd is high, and in the simulation in Table 5.1, the GTX 480 
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Fig. 5.5 Total time of the algorithm as a function of the number of sources TV. For the 
FMM-algorithm, the simulation was performed with p = 17, implying a tolerance of about 
10" 6 . 




Fig. 5.6 Speedup as a function of the number of sources TV. 
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card required 30% longer time than the Tesla card. On the other hand, the 
Tesla card required 25% longer time than the GTX 480 for the sorting (which 
is limited by memory access, rather than double precision math). The shift 
operators were approximately equally fast on both systems. The overall result 
is that the optimal value for Nd is lower for the GTX 480 card (35 instead of 
45) for a total running time which was approximately the same. This shows 
that the much cheaper GTX 480 gaming card is a perfectly reasonable option 
for this implementation of the fast multipole method, despite the fact that it 
is written in double precision. 

5.4 Benefits of adaptivity 

As a final computational experiment we investigated the performance of the 
adaptivity by using different point distributions. Under a relative tolerance of 
10~ 6 (p = 17 in (2.2) and (2.3)) we measured the evaluation time for increasing 
number of points sampled from three very different distributions. As shown in 
Figure 5.8, the code is robust under highly non-uniform inputs and scales well 
at least up to some 10 million source points. 

When the distribution of particles is increasingly non-uniform, more boxes 
will be in each others near-field resulting in more direct interactions. This 
is tested in Figure 5.9 for the two non-uniform distributions from Figure 5.8. 
Both the CPU- and GPU timings have been normalized to the time of a homo- 
geneous distribution. The results indicate that the decrease in performance for 
highly non-uniform distributions is less for the GPU version than for the CPU 
version. From the timings of the individual parts, it is seen that almost all the 
increase in computational time originates from the P2P-shift. The speedup 
for all time consuming individual parts is relatively constant with respect to 
the degree of non-uniformity (e.g. the parameter a in Figure 5.9) and the rea- 
son the GPU code handles a highly non-uniform distribution better is simply 
because the P2P evaluation has a higher speedup than the overall code. 

6 Conclusions 

We have demonstrated that all parts of the adaptive fast multipole algorithm 
can be efficiently implemented on a GPU platform in double precision. Overall, 
we obtain a speedup of about a factor of 11 compared to a highly optimized 
(including SSE intrinsics), albeit serial, CPU-implementation. This factor can 
be compared with the speedup of about 15 which we obtain obtained for the 
direct A-body evaluation, a task for which GPUs are generally understood to 
be well suited [18] (see Figure 5.6). 

In our implementation, all parts of the algorithm achieve speedups in about 
the same order of magnitude. Generally, we obtain a higher speedup when- 
ever the boxes contain some 20-25% more particles than the CPU version 
(see Figures 5.4 and 5.7). Given the data-parallel signature of this particu- 
lar operation, this result is quite expected. Another noteworthy result is that 
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Fig. 5.7 Speedup of individual parts as a function of the number of sources. 

our version of the GPU FMM is faster than the direct TV-body evaluation at 
around N = 3500 source points, see Figure 5.5. Our tests also show that the 
asymmetric adaptivity works at least as well on the GPU as on the CPU, and 
that in some cases it even performs better. 

When it comes to coding complexity it is not so straightforward to present 
actual figures, but some observations at least deserve to be mentioned. The 
topological phase was by far the most difficult part to implement on the GPU. 
In fact, the number of lines of codes approximately quadrupled when transfer- 
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Fig. 5.8 Performance of the code when evaluating the harmonic potential for three different 
distributions of points. The source points were (%) uniformly distributed in [0,1] X [0,1], 
(ii) normally distributed with variance 1/100, and (Hi) distributed in a 'layer' where the x- 
coordinatc is uniform, and the ^/-coordinate is again -/V(0, l/100)-distributed. For the purpose 
of comparison, all distributions were rejected to fit exactly within the unit square. The FMM 
mesh for case (ii) is shown in Figure 2.1. 




Fig. 5.9 Robustness of adaptivity. Time for two different non-uniform distributions nor- 
malized with respect to a uniform distribution of points. The top two graphs are for the 
normal distribution of sources, while the lower two graphs are for the 'layer' distribution. 
See text for further details. 
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ring this operation. We remark also that the topological phase performs rather 
well on the GPU, an observation which can be attributed to its comparably 
high internal bandwidth. Thus, there is a performance/software complexity 
issue here, and striking the right balance is not easy. 

By contrast, the easiest part to transfer was the direct evaluation (P2P), 
where, due to SSE-intrinsics, the CPU-code is in fact about twice the size than 
the corresponding CUDA-implementation. 

These observations as well as our experimental results, suggest that a bal- 
anced implementation, where parts of the algorithm are off-loaded to the GPU 
while the remaining parts are parallelized over the CPU-cores, would be a rea- 
sonable compromise. This has also been noted by others [3] and is ongoing 
research. 



6.1 Reproducibility 

Our implementation as described in this paper is available for download via 
the second author's web-page 1 . The code compiles both in a serial CPU- 
version and in a GPU-specific version and comes with a convenient Matlab 
mex-interface. Along with the code, automatic Matlab-scripts that repeat the 
numerical experiments presented here are also distributed. 
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